feat(codex): per-turn watchdog with TurnWatchdogError + exit 124#312
feat(codex): per-turn watchdog with TurnWatchdogError + exit 124#312OJamals wants to merge 1 commit into
Conversation
Adds an opt-in heartbeat watchdog at the captureTurn layer that aborts a turn when no JSON-RPC notification arrives within a configurable window. Default behavior preserved: opt-in via CODEX_TURN_WATCHDOG_MS env or per-call options.watchdogMs. Problem: runAppServerTurn -> captureTurn awaits state.completion indefinitely. The 250ms scheduleInferredCompletion timer only fires after finalAnswerSeen=true. When the JSON-RPC stream stalls entirely (model inference hang, network drop without TCP error, app-server stuck), no notifications arrive, the timer never fires, and the await hangs forever. --wait controls foreground vs background dispatch only; it is not a runtime bound. Fix: TurnWatchdogError carries code='TURN_WATCHDOG_TIMEOUT' + exitCode=124. createTurnCaptureState accepts watchdogMs. captureTurn arms the watchdog before startRequest, kicks (resets) it on every notification, and disarms in finally. On fire, the watchdog rejects state.completion with TurnWatchdogError, propagating through await + finally to the caller. codex-companion main().catch maps TurnWatchdogError to exit 124 and emits a structured error JSON. Heartbeat semantics: an active turn streaming notifications resets the timer indefinitely. Only true silence triggers abort. Tests: tests/turn-watchdog.test.mjs covers TurnWatchdogError shape (code, exitCode, message, threadId/turnId/watchdogMs metadata, default null fallbacks). Refs: this addresses a stalled-turn class that downstream wrappers (e.g. /smart-review in https://github.com/anthropics/claude-code workflows) currently mask with external alarm timeouts.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 186c9adbf1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| armWatchdog(state); | ||
| const response = await startRequest(); |
There was a problem hiding this comment.
Race the watchdog against the start request
When turn/start itself stops responding after the timer is armed, this only rejects state.completion; captureTurn is still blocked on await startRequest() and never observes that rejection, so the watchdog either leaves the call hung or surfaces as an unhandled rejection instead of returning the structured exit-124 error. This affects the same stalled JSON-RPC connection scenario before the start response is delivered; the timer needs to be raced with startRequest() or otherwise abort the pending request path too.
Useful? React with 👍 / 👎.
| watchdogMs: | ||
| typeof options.watchdogMs === "number" | ||
| ? options.watchdogMs | ||
| : Number(process.env.CODEX_TURN_WATCHDOG_MS) || null |
There was a problem hiding this comment.
Apply the watchdog to native review turns
The env-backed watchdog is only forwarded from runAppServerTurn, but the default review path calls runAppServerReview, which also awaits captureTurn and still passes no watchdogMs. As a result, setting CODEX_TURN_WATCHDOG_MS does not protect native /codex:review runs from the same silent JSON-RPC stall, even though that path uses the same capture layer; forward the option/env value from runAppServerReview as well.
Useful? React with 👍 / 👎.
Problem
runAppServerTurn→captureTurn(plugins/codex/scripts/lib/codex.mjs) awaitsstate.completionindefinitely. Completion resolves onturn/completedJSON-RPC notifications or on the 250msscheduleInferredCompletiontimer (which only fires afterfinalAnswerSeen=true). If the JSON-RPC stream stalls entirely — model inference hang, network drop without TCP error, app-server stuck — no notifications arrive, the timer never fires, and the await hangs forever.--waitcontrols foreground vs background dispatch only; it is not a runtime bound.Downstream wrappers currently work around this by wrapping every Codex review call with an external
timeout(1)/perl alarmshell guard. That works but is fragile (loses structured error info, only available where the wrapper exists).Fix
Per-turn heartbeat watchdog at the
captureTurnlayer:TurnWatchdogError extends Errorwithcode = 'TURN_WATCHDOG_TIMEOUT'andexitCode = 124(matchingtimeout(1)convention).createTurnCaptureStateacceptswatchdogMsand trackswatchdogTimer.captureTurnarms the watchdog on entry, resets it on every notification (heartbeat), disarms infinally.state.rejectCompletion(new TurnWatchdogError(...)), which propagates out ofawait state.completionand throughrunAppServerTurnto the caller.runAppServerTurnreadsoptions.watchdogMs ?? Number(process.env.CODEX_TURN_WATCHDOG_MS) || nulland forwards.codex-companion.mjs main().catchmapsTurnWatchdogError → process.exitCode = 124and writes a structured{error: 'TurnWatchdogTimeout', ...}JSON line to stderr.Heartbeat semantics
An active turn streaming notifications resets the timer indefinitely. Only true silence triggers abort. Default
nullpreserves existing behavior bit-for-bit. Opt-in viaCODEX_TURN_WATCHDOG_MSor per-calloptions.watchdogMs.TurnWatchdogError, exit 124Tests
tests/turn-watchdog.test.mjscoversTurnWatchdogErrorshape: code, exitCode, message, threadId/turnId/watchdogMs metadata, default-null fallbacks. Fullnode --test tests/*.test.mjsrun is green for all suites unaffected by pre-existing darwin tmp-dir flakes (state/runtime tests; same 4 fail on untouched v1.0.4).captureTurnis internal, so the included unit test exercises the exported error class. A behavioral test would require either exportingcaptureTurnor wiring a fake JSON-RPC client throughrunAppServerTurn; happy to extend if maintainers prefer.Files
plugins/codex/scripts/lib/codex.mjs—+77/-1plugins/codex/scripts/codex-companion.mjs—+15/-1tests/turn-watchdog.test.mjs—+31(new)Total:
+122/-2. No public API changes; one new export (TurnWatchdogError).Context
This was discovered while investigating stalled
/codex:reviewruns in a downstream Claude Code integration. Full investigation notes + design doc:executeReviewRunblocked onstate.completionwith no diagnostic.perl alarm 600swrapper.TurnWatchdogErrorinfo propagates to callers.